The marketing team of an e-commerce site has launched an email campaign. This site has email addresses from all the users who created an account in the past.
They have chosen a random sample of users and emailed them. The email let the user know about a new feature implemented on the site. From the marketing team perspective, a success is if the user clicks on the link inside of the email. This link takes the user to the company site.
You are in charge of figuring out how the email campaign performed and were asked the following questions:
What percentage of users opened the email and what percentage clicked on the link within the email?
The VP of marketing thinks that it is stupid to send emails to a random subset and in a random way. Based on all the information you have about the emails that were sent, can you build a model to optimize in future email campaigns to maximize the probability of users clicking on the link inside the email?
By how much do you think your model would improve click through rate ( defined as # of users who click on the link / total users who received the email). How would you test that?
Did you find any interesting pattern on how the email campaign performed for different segments of users? Explain.
email_id : the Id of the email that was sent. It is unique by email
email_text : there are two versions of the email: one has “long text” (i.e. has 4 paragraphs) and one has “short text” (just 2 paragraphs)
email_version : some emails were “personalized” (i.e. they had the name of the user receiving the email in the incipit, such as “Hi John,”), while some emails were “generic” (the incipit was just “Hi,”).
hour : the user local time when the email was sent.
weekday : the day when the email was sent.
user_country : the country where the user receiving the email was based. It comes from the user ip address when she created the account.
user_past_purchases : how many items in the past were bought by the user receiving the email
# Data manipulation
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# Data Exploration
from dataprep.eda import *
# showing multiple outputs
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
%matplotlib inline
#model evaluation
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn import metrics as mt
We import all the required local libraries libraries
#pip install catboost
# Include local library paths
import sys
import plotly
import plotly.graph_objs as go
import xgboost as xgb
from catboost import Pool,CatBoostClassifier
# sys.path.append('path/to/local/lib') # uncomment and fill to import local libraries
# Import local libraries
We set all relevant parameters for our notebook. By convention, parameters are uppercase, while all the other variables follow Python's guidelines.
We retrieve all the required data for the analysis.
email_open_dt = pd.read_csv('./Marketing Project/email_opened_table.csv')
email_table = pd.read_csv('./Marketing Project/email_table.csv')
link_click = pd.read_csv('./Marketing Project/link_clicked_table.csv')
email_table.head()
email_open_dt.head()
link_click.head()
| email_id | email_text | email_version | hour | weekday | user_country | user_past_purchases | |
|---|---|---|---|---|---|---|---|
| 0 | 85120 | short_email | personalized | 2 | Sunday | US | 5 |
| 1 | 966622 | long_email | personalized | 12 | Sunday | UK | 2 |
| 2 | 777221 | long_email | personalized | 11 | Wednesday | US | 2 |
| 3 | 493711 | short_email | generic | 6 | Monday | UK | 1 |
| 4 | 106887 | long_email | generic | 14 | Monday | US | 6 |
| email_id | |
|---|---|
| 0 | 284534 |
| 1 | 609056 |
| 2 | 220820 |
| 3 | 905936 |
| 4 | 164034 |
| email_id | |
|---|---|
| 0 | 609056 |
| 1 | 870980 |
| 2 | 935124 |
| 3 | 158501 |
| 4 | 177561 |
Put here the core of the notebook. Feel free di further split this section into subsections.
plot(email_table)
| Number of Variables | 7 |
|---|---|
| Number of Rows | 100000 |
| Missing Cells | 0 |
| Missing Cells (%) | 0.0% |
| Duplicate Rows | 0 |
| Duplicate Rows (%) | 0.0% |
| Total Size in Memory | 26.8 MB |
| Average Row Size in Memory | 281.1 B |
| Variable Types |
|
| user_past_purchases is skewed | Skewed |
|---|---|
| user_country has constant length 2 | Constant Length |
| user_past_purchases has 13877 (13.88%) zeros | Zeros |
email_open_dt['opened'] = 1 # emails that were opened
link_click['clicked'] = 1 # emails that were clicked
# Merging all tables
df = pd.merge(email_table, email_open_dt, on = 'email_id', how = 'left')
df = pd.merge(df, link_click, on = 'email_id', how = 'left')
df.head()
| email_id | email_text | email_version | hour | weekday | user_country | user_past_purchases | opened | clicked | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 85120 | short_email | personalized | 2 | Sunday | US | 5 | NaN | NaN |
| 1 | 966622 | long_email | personalized | 12 | Sunday | UK | 2 | 1.0 | 1.0 |
| 2 | 777221 | long_email | personalized | 11 | Wednesday | US | 2 | NaN | NaN |
| 3 | 493711 | short_email | generic | 6 | Monday | UK | 1 | NaN | NaN |
| 4 | 106887 | long_email | generic | 14 | Monday | US | 6 | NaN | NaN |
df.fillna(0, inplace = True)
In an email marketing campaign, some of the metrics using to measure the performance should be average email open rate, average click through rate, click to open rate. However, those metrics hugely depend on the content of the email and vary by the industry. For example, email promotion will have higher open and click rate than product introduction email. In our scenario, the email campaign deliver the information of the new feature implemented on the site which is not really attractive to the users intuitively. Therefore, the expected CTR(click through rate) will not really high. Let's look at some of the reference value below:
A high-level overview of overall email marketing statistics for 2020:
Sources: https://www.campaignmonitor.com/resources/guides/email-marketing-benchmarks/
plot(df,"opened")
plot(df,"clicked")
| Distinct Count | 2 |
|---|---|
| Unique (%) | 0.0% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Infinite | 0 |
| Infinite (%) | 0.0% |
| Memory Size | 1.5 MB |
| Mean | 0.1035 |
| Minimum | 0 |
| Maximum | 1 |
| Zeros | 89655 |
| Zeros (%) | 89.6% |
| Negatives | 0 |
| Negatives (%) | 0.0% |
| Minimum | 0 |
|---|---|
| 5-th Percentile | 0 |
| Q1 | 0 |
| Median | 0 |
| Q3 | 0 |
| 95-th Percentile | 1 |
| Maximum | 1 |
| Range | 1 |
| IQR | 0 |
| Mean | 0.1035 |
|---|---|
| Standard Deviation | 0.3045 |
| Variance | 0.09275 |
| Sum | 10345 |
| Skewness | 2.6042 |
| Kurtosis | 4.7819 |
| Coefficient of Variation | 2.9439 |
| Distinct Count | 2 |
|---|---|
| Unique (%) | 0.0% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Infinite | 0 |
| Infinite (%) | 0.0% |
| Memory Size | 1.5 MB |
| Mean | 0.02119 |
| Minimum | 0 |
| Maximum | 1 |
| Zeros | 97881 |
| Zeros (%) | 97.9% |
| Negatives | 0 |
| Negatives (%) | 0.0% |
| Minimum | 0 |
|---|---|
| 5-th Percentile | 0 |
| Q1 | 0 |
| Median | 0 |
| Q3 | 0 |
| 95-th Percentile | 0 |
| Maximum | 1 |
| Range | 1 |
| IQR | 0 |
| Mean | 0.02119 |
|---|---|
| Standard Deviation | 0.144 |
| Variance | 0.02074 |
| Sum | 2119 |
| Skewness | 6.6493 |
| Kurtosis | 42.2137 |
| Coefficient of Variation | 6.7965 |
Comment: From the chart above, we could see that among 100000 sent emails, around 10% opened, and 2119 click on the link (CTO - Click To Opened - is 10%), CTR - Click Through Rate - is 2%. Based the reference values above we could say that those ratio indicate a good performance. It could be that the content is quite well-designed, well-personalized to the readers.
In the next steps, we will dive deeper to the other analyses to understand more the behavior of the email receivers.
opened_email_df = df.loc[
df['opened'] == 1
].groupby([
'email_text', 'email_version',"weekday"
]).count()['opened']
opened_email_df = opened_email_df.unstack().fillna(0)
opened_email_df
| weekday | Friday | Monday | Saturday | Sunday | Thursday | Tuesday | Wednesday | |
|---|---|---|---|---|---|---|---|---|
| email_text | email_version | |||||||
| long_email | generic | 170 | 290 | 218 | 215 | 306 | 283 | 305 |
| personalized | 288 | 445 | 328 | 355 | 440 | 451 | 490 | |
| short_email | generic | 232 | 364 | 284 | 242 | 354 | 361 | 360 |
| personalized | 361 | 568 | 447 | 449 | 591 | 609 | 539 |
opened_email_df.loc[('Sum','Total by day'),:] = opened_email_df.sum()
opened_email_df
| weekday | Friday | Monday | Saturday | Sunday | Thursday | Tuesday | Wednesday | |
|---|---|---|---|---|---|---|---|---|
| email_text | email_version | |||||||
| long_email | generic | 170.0 | 290.0 | 218.0 | 215.0 | 306.0 | 283.0 | 305.0 |
| personalized | 288.0 | 445.0 | 328.0 | 355.0 | 440.0 | 451.0 | 490.0 | |
| short_email | generic | 232.0 | 364.0 | 284.0 | 242.0 | 354.0 | 361.0 | 360.0 |
| personalized | 361.0 | 568.0 | 447.0 | 449.0 | 591.0 | 609.0 | 539.0 | |
| Sum | Total by day | 1051.0 | 1667.0 | 1277.0 | 1261.0 | 1691.0 | 1704.0 | 1694.0 |
Comment: People tend to open email during the weekdays rather than during the weekend.
email_df_sum = df.loc[
df['clicked'] == 1
].groupby([
'email_text', 'email_version',
]).sum()['clicked']
email_df_sum
email_df = df.loc[
df['clicked'] == 1
].groupby([
'email_text', 'email_version',"weekday"
]).count()['clicked']
email_df = email_df.unstack().fillna(0)
email_df
email_text email_version
long_email generic 346.0
personalized 586.0
short_email generic 414.0
personalized 773.0
Name: clicked, dtype: float64
| weekday | Friday | Monday | Saturday | Sunday | Thursday | Tuesday | Wednesday | |
|---|---|---|---|---|---|---|---|---|
| email_text | email_version | |||||||
| long_email | generic | 23 | 51 | 40 | 39 | 65 | 67 | 61 |
| personalized | 68 | 97 | 64 | 60 | 93 | 89 | 115 | |
| short_email | generic | 34 | 65 | 55 | 37 | 76 | 70 | 77 |
| personalized | 74 | 116 | 101 | 105 | 115 | 126 | 136 |
email_df/opened_email_df
| weekday | Friday | Monday | Saturday | Sunday | Thursday | Tuesday | Wednesday | |
|---|---|---|---|---|---|---|---|---|
| email_text | email_version | |||||||
| Sum | Total by day | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| long_email | generic | 0.135294 | 0.175862 | 0.183486 | 0.181395 | 0.212418 | 0.236749 | 0.200000 |
| personalized | 0.236111 | 0.217978 | 0.195122 | 0.169014 | 0.211364 | 0.197339 | 0.234694 | |
| short_email | generic | 0.146552 | 0.178571 | 0.193662 | 0.152893 | 0.214689 | 0.193906 | 0.213889 |
| personalized | 0.204986 | 0.204225 | 0.225951 | 0.233853 | 0.194585 | 0.206897 | 0.252319 |
ax = email_df.plot(
kind='bar',
figsize=(20, 10),
grid=True
#stack = True
)
ax.set_ylabel('Click Count')
plt.show()
ax_open_email = opened_email_df.plot(
kind='bar',
figsize=(20, 10),
grid=True
#stack = True
)
ax_open_email.set_ylabel('Open Count')
plt.show()
Text(0, 0.5, 'Click Count')
Text(0, 0.5, 'Open Count')
Comment:: in general, we could see that there is a high correlation between the number of open count and click count
1. the personalized email has higher CTO(Click to opened) than general content
2. the short email converted more clicks than the long email
3. Wednesday(no.1), Tuesday(no.2), Monday(no.3) has more clicks than the other days, although the number of opened email are pretty identical
click_by_hour_df = df.loc[
df['clicked'] == 1
].groupby([
'email_text', 'email_version',"weekday","hour"
]).count()['clicked']
click_by_hour_df= click_by_hour_df.unstack().fillna(0)
click_by_hour_sp = click_by_hour_df.loc[('short_email','personalized'),:]
click_by_hour_sp.loc[('Total by hour'),:] = click_by_hour_sp.sum()
click_by_hour_sp
| hour | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | ... | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| weekday | |||||||||||||||||||||
| Friday | 2.0 | 4.0 | 1.0 | 1.0 | 3.0 | 4.0 | 7.0 | 4.0 | 12.0 | 8.0 | ... | 1.0 | 4.0 | 1.0 | 1.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 |
| Monday | 3.0 | 4.0 | 6.0 | 4.0 | 9.0 | 4.0 | 7.0 | 9.0 | 10.0 | 8.0 | ... | 5.0 | 7.0 | 1.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| Saturday | 2.0 | 5.0 | 4.0 | 7.0 | 5.0 | 7.0 | 7.0 | 13.0 | 6.0 | 7.0 | ... | 5.0 | 2.0 | 1.0 | 2.0 | 1.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 |
| Sunday | 2.0 | 2.0 | 5.0 | 2.0 | 9.0 | 6.0 | 10.0 | 13.0 | 16.0 | 6.0 | ... | 2.0 | 3.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 |
| Thursday | 0.0 | 2.0 | 7.0 | 5.0 | 2.0 | 6.0 | 9.0 | 6.0 | 14.0 | 11.0 | ... | 4.0 | 4.0 | 4.0 | 1.0 | 2.0 | 2.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| Tuesday | 4.0 | 6.0 | 2.0 | 5.0 | 1.0 | 8.0 | 12.0 | 10.0 | 25.0 | 12.0 | ... | 7.0 | 8.0 | 1.0 | 2.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| Wednesday | 5.0 | 3.0 | 1.0 | 7.0 | 13.0 | 8.0 | 6.0 | 9.0 | 14.0 | 19.0 | ... | 8.0 | 2.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| Total by hour | 18.0 | 26.0 | 26.0 | 31.0 | 42.0 | 43.0 | 58.0 | 64.0 | 97.0 | 71.0 | ... | 32.0 | 30.0 | 10.0 | 7.0 | 5.0 | 5.0 | 1.0 | 2.0 | 2.0 | 0.0 |
8 rows × 24 columns
import plotly.express as px
fig = px.pie(pd.DataFrame(click_by_hour_sp.loc["Total by hour"]).reset_index(),
values='Total by hour', names='hour',
title='Population of European continent')
fig.show()
Comment: It seems like the email receiver tend to open the email when they start the work from 8-11 am.
email_past_purchases_df = df.loc[
df['clicked'] == 1
].groupby([
'user_past_purchases'
]).count()['clicked'].reset_index()
fig = px.line(email_past_purchases_df, x="user_past_purchases", y="clicked", title='User Past Purchases and Click Relationship')
fig.show()
email_past_purchases_df["total user"] = df.groupby(['user_past_purchases']).count()['clicked']
email_past_purchases_df["clicked over total ratio"] = email_past_purchases_df.clicked/email_past_purchases_df["total user"]
email_past_purchases_df
| user_past_purchases | clicked | total user | clicked over total ratio | |
|---|---|---|---|---|
| 0 | 0 | 7 | 13877 | 0.000504 |
| 1 | 1 | 154 | 13751 | 0.011199 |
| 2 | 2 | 200 | 13036 | 0.015342 |
| 3 | 3 | 200 | 12077 | 0.016560 |
| 4 | 4 | 230 | 10743 | 0.021409 |
| 5 | 5 | 201 | 9042 | 0.022230 |
| 6 | 6 | 241 | 7518 | 0.032056 |
| 7 | 7 | 186 | 6051 | 0.030739 |
| 8 | 8 | 174 | 4393 | 0.039608 |
| 9 | 9 | 150 | 3296 | 0.045510 |
| 10 | 10 | 110 | 2363 | 0.046551 |
| 11 | 11 | 87 | 1553 | 0.056021 |
| 12 | 12 | 62 | 944 | 0.065678 |
| 13 | 13 | 38 | 578 | 0.065744 |
| 14 | 14 | 33 | 362 | 0.091160 |
| 15 | 15 | 22 | 188 | 0.117021 |
| 16 | 16 | 12 | 102 | 0.117647 |
| 17 | 17 | 5 | 60 | 0.083333 |
| 18 | 18 | 1 | 35 | 0.028571 |
| 19 | 19 | 3 | 15 | 0.200000 |
| 20 | 21 | 2 | 11 | 0.181818 |
| 21 | 22 | 1 | 4 | 0.250000 |
Comment: The larger the amount of purchases, the higher the number of click ratio. The reason could be those people are become the loyal customers of the website so they are more likely updating the related information. While the customer with low frequency tend to ignore the email, indeed they are the group of customer which easily switch to the rivals
From the analysis above, we could see the pattern of customer behavior to optimize the email campaign better by:
Techniques:
Sending the emails on weekdays, between 6 am to 7 am before people start working, so it could be on the top list of the mailbox, which on the other hand could increase the opening rate.
Working closely with the customer service department to take care the group of loyalty customer (the one who made more than 8 to 10 purchases , besides increasing the frequent purchase of the customer who made fewer transactions. -> The value of that action is we could reduce the cost of the email campaign by stopping send those kind of email - announcing new features on the website - to group of customer with a few purchases(the email service providers could charge based on the number of emails sending out) but sending the emails with promotions and offers to let them more familiar with the product and service.
Content: The content should be short and well-personalized to the reader with a clear call-to-action. We could achieve those goals by:
Personal thought:
Thought: since the ultimate goal of an e-commerce website is increasing the number of transactions of the buyers as well as the number of clients (store), therefore it will be better if we could see a bigger picture by obtaining more data from sales and customer services department.
-> Action - data collection - combining the marketing data with sales and customer services data
from sklearn.model_selection import StratifiedShuffleSplit
split = StratifiedShuffleSplit(n_splits=1, test_size=0.025, random_state=42)
for train_index, test_index in split.split(df.iloc[:,1:-2], df["clicked"]):
strat_train_set = df.loc[train_index]
strat_test_set = df.loc[test_index]
for train_index, validation_index in split.split(strat_train_set.iloc[:,1:-2], strat_train_set["clicked"]):
strat_train_set = df.loc[train_index]
strat_validation_set = df.loc[validation_index]
#drop the unnecessary columns
strat_train_set.drop(['email_id',"opened"], axis=1, inplace=True)
strat_validation_set.drop(['email_id',"opened"], axis=1, inplace=True)
strat_test_set.drop(['email_id',"opened"], axis=1, inplace=True)
#check the shape of each dataset
strat_train_set.shape
strat_validation_set.shape
strat_test_set.shape
(95062, 7)
(2438, 7)
(2500, 7)
Notes: the model we are going to employ is CatBoost because of its advantages:
Reference Link: https://catboost.ai/
#identify the categorical features
categorical_features_indices = np.where(strat_train_set.dtypes == np.object)[0]
categorical_features_indices
array([0, 1, 3, 4])
Notes: Since the data is imbalanced so we will apply more weight on the minority class (in this case: the value "1") of "clicked". And the weight is calculated as below
#calculating the weight of each class
from sklearn.utils.class_weight import compute_class_weight
class_weights = compute_class_weight('balanced', np.unique(strat_train_set.iloc[:,-1]), strat_train_set.iloc[:,-1])
class_weights
/home/bryanle/miniconda3/envs/GPU/lib/python3.8/site-packages/sklearn/utils/validation.py:67: FutureWarning:
Pass classes=[0. 1.], y=63621 0.0
5412 0.0
91392 0.0
69581 0.0
93559 0.0
...
8560 0.0
39470 0.0
75734 0.0
48080 0.0
25131 0.0
Name: clicked, Length: 95062, dtype: float64 as keyword args. From version 0.25 passing these as positional arguments will result in an error
array([ 0.51076199, 23.72990514])
#build the model with hyperparameters
model = CatBoostClassifier(iterations=200, task_type="GPU", learning_rate=0.05, l2_leaf_reg=1, depth=11, loss_function= 'Logloss', eval_metric='AUC',use_best_model=True,random_seed=42, class_weights=[0.51, 23.73],cat_features=categorical_features_indices)
#train the model
model.fit(strat_train_set.iloc
[:,:-1],strat_train_set.iloc[:,-1],cat_features=categorical_features_indices,
eval_set=(strat_validation_set.iloc[:,:-1], strat_validation_set.iloc[:,-1]),plot =True)
Warning: less than 75% gpu memory available for training. Free: 4556.1875 Total: 7979.1875
0: learn: 0.6643690 test: 0.6263420 best: 0.6263420 (0) total: 123ms remaining: 24.5s 1: learn: 0.7180090 test: 0.7052840 best: 0.7052840 (1) total: 226ms remaining: 22.3s 2: learn: 0.7253316 test: 0.7156973 best: 0.7156973 (2) total: 329ms remaining: 21.6s 3: learn: 0.7232333 test: 0.7052316 best: 0.7156973 (2) total: 446ms remaining: 21.8s 4: learn: 0.7254205 test: 0.7139443 best: 0.7156973 (2) total: 563ms remaining: 21.9s 5: learn: 0.7295915 test: 0.7188205 best: 0.7188205 (5) total: 676ms remaining: 21.9s 6: learn: 0.7300110 test: 0.7092132 best: 0.7188205 (5) total: 780ms remaining: 21.5s 7: learn: 0.7312033 test: 0.7137025 best: 0.7188205 (5) total: 886ms remaining: 21.3s 8: learn: 0.7319052 test: 0.7171118 best: 0.7188205 (5) total: 1.01s remaining: 21.5s 9: learn: 0.7320214 test: 0.7125177 best: 0.7188205 (5) total: 1.12s remaining: 21.3s 10: learn: 0.7328004 test: 0.7163381 best: 0.7188205 (5) total: 1.22s remaining: 20.9s 11: learn: 0.7331212 test: 0.7174020 best: 0.7188205 (5) total: 1.33s remaining: 20.8s 12: learn: 0.7325383 test: 0.7159512 best: 0.7188205 (5) total: 1.43s remaining: 20.5s 13: learn: 0.7329659 test: 0.7196668 best: 0.7196668 (13) total: 1.53s remaining: 20.3s 14: learn: 0.7331696 test: 0.7194734 best: 0.7196668 (13) total: 1.64s remaining: 20.3s 15: learn: 0.7337915 test: 0.7214077 best: 0.7214077 (15) total: 1.78s remaining: 20.4s 16: learn: 0.7342539 test: 0.7192880 best: 0.7214077 (15) total: 1.91s remaining: 20.6s 17: learn: 0.7345652 test: 0.7199892 best: 0.7214077 (15) total: 2.04s remaining: 20.6s 18: learn: 0.7390833 test: 0.7186553 best: 0.7214077 (15) total: 2.15s remaining: 20.4s 19: learn: 0.7388381 test: 0.7182120 best: 0.7214077 (15) total: 2.26s remaining: 20.3s 20: learn: 0.7413458 test: 0.7171803 best: 0.7214077 (15) total: 2.37s remaining: 20.2s 21: learn: 0.7427779 test: 0.7200053 best: 0.7214077 (15) total: 2.48s remaining: 20.1s 22: learn: 0.7431681 test: 0.7226328 best: 0.7226328 (22) total: 2.58s remaining: 19.9s 23: learn: 0.7435524 test: 0.7232857 best: 0.7232857 (23) total: 2.69s remaining: 19.7s 24: learn: 0.7450643 test: 0.7215125 best: 0.7232857 (23) total: 2.79s remaining: 19.5s 25: learn: 0.7450861 test: 0.7223749 best: 0.7232857 (23) total: 2.9s remaining: 19.4s 26: learn: 0.7456307 test: 0.7228827 best: 0.7232857 (23) total: 3s remaining: 19.2s 27: learn: 0.7471726 test: 0.7219719 best: 0.7232857 (23) total: 3.1s remaining: 19.1s 28: learn: 0.7484308 test: 0.7219719 best: 0.7232857 (23) total: 3.21s remaining: 18.9s 29: learn: 0.7484613 test: 0.7226650 best: 0.7232857 (23) total: 3.31s remaining: 18.8s 30: learn: 0.7497722 test: 0.7213190 best: 0.7232857 (23) total: 3.41s remaining: 18.6s 31: learn: 0.7502713 test: 0.7224797 best: 0.7232857 (23) total: 3.52s remaining: 18.5s 32: learn: 0.7507837 test: 0.7235517 best: 0.7235517 (32) total: 3.62s remaining: 18.3s 33: learn: 0.7507444 test: 0.7233501 best: 0.7235517 (32) total: 3.75s remaining: 18.3s 34: learn: 0.7510708 test: 0.7248251 best: 0.7248251 (34) total: 3.87s remaining: 18.2s 35: learn: 0.7510317 test: 0.7244705 best: 0.7248251 (34) total: 4s remaining: 18.2s 36: learn: 0.7510557 test: 0.7236645 best: 0.7248251 (34) total: 4.1s remaining: 18.1s 37: learn: 0.7522976 test: 0.7239385 best: 0.7248251 (34) total: 4.21s remaining: 17.9s 38: learn: 0.7523970 test: 0.7234549 best: 0.7248251 (34) total: 4.31s remaining: 17.8s 39: learn: 0.7527754 test: 0.7241400 best: 0.7248251 (34) total: 4.42s remaining: 17.7s 40: learn: 0.7527917 test: 0.7244705 best: 0.7248251 (34) total: 4.51s remaining: 17.5s 41: learn: 0.7537684 test: 0.7249541 best: 0.7249541 (41) total: 4.62s remaining: 17.4s 42: learn: 0.7536503 test: 0.7249702 best: 0.7249702 (42) total: 4.72s remaining: 17.2s 43: learn: 0.7544419 test: 0.7251556 best: 0.7251556 (43) total: 4.83s remaining: 17.1s 44: learn: 0.7553442 test: 0.7245430 best: 0.7251556 (43) total: 4.93s remaining: 17s 45: learn: 0.7561348 test: 0.7249218 best: 0.7251556 (43) total: 5.04s remaining: 16.9s 46: learn: 0.7562586 test: 0.7253571 best: 0.7253571 (46) total: 5.13s remaining: 16.7s 47: learn: 0.7563295 test: 0.7254457 best: 0.7254457 (47) total: 5.24s remaining: 16.6s 48: learn: 0.7568477 test: 0.7253168 best: 0.7254457 (47) total: 5.34s remaining: 16.4s 49: learn: 0.7569757 test: 0.7256795 best: 0.7256795 (49) total: 5.45s remaining: 16.3s 50: learn: 0.7575300 test: 0.7251797 best: 0.7256795 (49) total: 5.54s remaining: 16.2s 51: learn: 0.7576984 test: 0.7258084 best: 0.7258084 (51) total: 5.65s remaining: 16.1s 52: learn: 0.7582468 test: 0.7257358 best: 0.7258084 (51) total: 5.75s remaining: 15.9s 53: learn: 0.7585774 test: 0.7245188 best: 0.7258084 (51) total: 5.85s remaining: 15.8s 54: learn: 0.7586747 test: 0.7249782 best: 0.7258084 (51) total: 5.95s remaining: 15.7s 55: learn: 0.7587759 test: 0.7256553 best: 0.7258084 (51) total: 6.06s remaining: 15.6s 56: learn: 0.7595783 test: 0.7272430 best: 0.7272430 (56) total: 6.16s remaining: 15.5s 57: learn: 0.7598998 test: 0.7275856 best: 0.7275856 (57) total: 6.26s remaining: 15.3s 58: learn: 0.7601248 test: 0.7283029 best: 0.7283029 (58) total: 6.36s remaining: 15.2s 59: learn: 0.7611034 test: 0.7283513 best: 0.7283513 (59) total: 6.47s remaining: 15.1s 60: learn: 0.7614176 test: 0.7290847 best: 0.7290847 (60) total: 6.57s remaining: 15s 61: learn: 0.7617210 test: 0.7280047 best: 0.7290847 (60) total: 6.67s remaining: 14.9s 62: learn: 0.7624281 test: 0.7276017 best: 0.7290847 (60) total: 6.77s remaining: 14.7s 63: learn: 0.7625531 test: 0.7280611 best: 0.7290847 (60) total: 6.88s remaining: 14.6s 64: learn: 0.7634966 test: 0.7284520 best: 0.7290847 (60) total: 6.98s remaining: 14.5s 65: learn: 0.7643306 test: 0.7286213 best: 0.7290847 (60) total: 7.09s remaining: 14.4s 66: learn: 0.7645572 test: 0.7285649 best: 0.7290847 (60) total: 7.19s remaining: 14.3s 67: learn: 0.7651058 test: 0.7292822 best: 0.7292822 (67) total: 7.31s remaining: 14.2s 68: learn: 0.7658464 test: 0.7284842 best: 0.7292822 (67) total: 7.43s remaining: 14.1s 69: learn: 0.7663427 test: 0.7281538 best: 0.7292822 (67) total: 7.54s remaining: 14s 70: learn: 0.7670684 test: 0.7284117 best: 0.7292822 (67) total: 7.65s remaining: 13.9s 71: learn: 0.7671482 test: 0.7289356 best: 0.7292822 (67) total: 7.76s remaining: 13.8s 72: learn: 0.7677751 test: 0.7296086 best: 0.7296086 (72) total: 7.87s remaining: 13.7s 73: learn: 0.7678436 test: 0.7297537 best: 0.7297537 (73) total: 7.99s remaining: 13.6s 74: learn: 0.7681959 test: 0.7292943 best: 0.7297537 (73) total: 8.09s remaining: 13.5s 75: learn: 0.7682548 test: 0.7289558 best: 0.7297537 (73) total: 8.2s remaining: 13.4s 76: learn: 0.7685795 test: 0.7284319 best: 0.7297537 (73) total: 8.32s remaining: 13.3s 77: learn: 0.7687186 test: 0.7291976 best: 0.7297537 (73) total: 8.43s remaining: 13.2s 78: learn: 0.7692109 test: 0.7285044 best: 0.7297537 (73) total: 8.53s remaining: 13.1s 79: learn: 0.7698416 test: 0.7281659 best: 0.7297537 (73) total: 8.63s remaining: 13s 80: learn: 0.7699514 test: 0.7284802 best: 0.7297537 (73) total: 8.73s remaining: 12.8s 81: learn: 0.7705067 test: 0.7289155 best: 0.7297537 (73) total: 8.83s remaining: 12.7s 82: learn: 0.7708595 test: 0.7292459 best: 0.7297537 (73) total: 8.93s remaining: 12.6s 83: learn: 0.7711472 test: 0.7296247 best: 0.7297537 (73) total: 9.03s remaining: 12.5s 84: learn: 0.7712097 test: 0.7300036 best: 0.7300036 (84) total: 9.13s remaining: 12.4s 85: learn: 0.7712805 test: 0.7296328 best: 0.7300036 (84) total: 9.23s remaining: 12.2s 86: learn: 0.7715071 test: 0.7290767 best: 0.7300036 (84) total: 9.33s remaining: 12.1s 87: learn: 0.7722525 test: 0.7289558 best: 0.7300036 (84) total: 9.43s remaining: 12s 88: learn: 0.7727486 test: 0.7288187 best: 0.7300036 (84) total: 9.53s remaining: 11.9s 89: learn: 0.7729524 test: 0.7284480 best: 0.7300036 (84) total: 9.65s remaining: 11.8s 90: learn: 0.7730875 test: 0.7289558 best: 0.7300036 (84) total: 9.77s remaining: 11.7s 91: learn: 0.7735674 test: 0.7290202 best: 0.7300036 (84) total: 9.87s remaining: 11.6s 92: learn: 0.7739719 test: 0.7291250 best: 0.7300036 (84) total: 9.97s remaining: 11.5s 93: learn: 0.7741673 test: 0.7286817 best: 0.7300036 (84) total: 10.1s remaining: 11.4s 94: learn: 0.7742848 test: 0.7286575 best: 0.7300036 (84) total: 10.2s remaining: 11.2s 95: learn: 0.7746954 test: 0.7282868 best: 0.7300036 (84) total: 10.3s remaining: 11.1s 96: learn: 0.7750981 test: 0.7279564 best: 0.7300036 (84) total: 10.4s remaining: 11s 97: learn: 0.7755244 test: 0.7279322 best: 0.7300036 (84) total: 10.5s remaining: 10.9s 98: learn: 0.7755959 test: 0.7280611 best: 0.7300036 (84) total: 10.6s remaining: 10.8s 99: learn: 0.7757206 test: 0.7285528 best: 0.7300036 (84) total: 10.7s remaining: 10.7s 100: learn: 0.7758232 test: 0.7285770 best: 0.7300036 (84) total: 10.8s remaining: 10.5s 101: learn: 0.7758681 test: 0.7288429 best: 0.7300036 (84) total: 10.9s remaining: 10.4s 102: learn: 0.7759375 test: 0.7289235 best: 0.7300036 (84) total: 11s remaining: 10.3s 103: learn: 0.7760528 test: 0.7286575 best: 0.7300036 (84) total: 11.1s remaining: 10.2s 104: learn: 0.7762351 test: 0.7280531 best: 0.7300036 (84) total: 11.2s remaining: 10.1s 105: learn: 0.7764876 test: 0.7281860 best: 0.7300036 (84) total: 11.3s remaining: 10s 106: learn: 0.7768056 test: 0.7285084 best: 0.7300036 (84) total: 11.4s remaining: 9.93s 107: learn: 0.7768753 test: 0.7284440 best: 0.7300036 (84) total: 11.5s remaining: 9.81s 108: learn: 0.7770225 test: 0.7281619 best: 0.7300036 (84) total: 11.6s remaining: 9.7s 109: learn: 0.7770861 test: 0.7283069 best: 0.7300036 (84) total: 11.7s remaining: 9.59s 110: learn: 0.7772807 test: 0.7284117 best: 0.7300036 (84) total: 11.8s remaining: 9.48s 111: learn: 0.7775432 test: 0.7284964 best: 0.7300036 (84) total: 11.9s remaining: 9.36s 112: learn: 0.7775673 test: 0.7289235 best: 0.7300036 (84) total: 12s remaining: 9.25s 113: learn: 0.7777511 test: 0.7281659 best: 0.7300036 (84) total: 12.1s remaining: 9.14s 114: learn: 0.7780447 test: 0.7280692 best: 0.7300036 (84) total: 12.2s remaining: 9.02s 115: learn: 0.7782331 test: 0.7280450 best: 0.7300036 (84) total: 12.3s remaining: 8.91s 116: learn: 0.7782399 test: 0.7284480 best: 0.7300036 (84) total: 12.4s remaining: 8.81s 117: learn: 0.7782511 test: 0.7287543 best: 0.7300036 (84) total: 12.5s remaining: 8.71s 118: learn: 0.7786106 test: 0.7287704 best: 0.7300036 (84) total: 12.6s remaining: 8.61s 119: learn: 0.7787268 test: 0.7287140 best: 0.7300036 (84) total: 12.8s remaining: 8.5s 120: learn: 0.7787657 test: 0.7289397 best: 0.7300036 (84) total: 12.8s remaining: 8.39s 121: learn: 0.7789989 test: 0.7291250 best: 0.7300036 (84) total: 12.9s remaining: 8.28s 122: learn: 0.7793225 test: 0.7290001 best: 0.7300036 (84) total: 13.1s remaining: 8.17s 123: learn: 0.7797836 test: 0.7290888 best: 0.7300036 (84) total: 13.2s remaining: 8.07s 124: learn: 0.7797964 test: 0.7292822 best: 0.7300036 (84) total: 13.3s remaining: 7.96s 125: learn: 0.7799628 test: 0.7293467 best: 0.7300036 (84) total: 13.4s remaining: 7.85s 126: learn: 0.7802356 test: 0.7296288 best: 0.7300036 (84) total: 13.5s remaining: 7.74s 127: learn: 0.7809932 test: 0.7298303 best: 0.7300036 (84) total: 13.6s remaining: 7.63s 128: learn: 0.7810655 test: 0.7296529 best: 0.7300036 (84) total: 13.7s remaining: 7.52s 129: learn: 0.7811445 test: 0.7298464 best: 0.7300036 (84) total: 13.8s remaining: 7.41s 130: learn: 0.7815051 test: 0.7293950 best: 0.7300036 (84) total: 13.9s remaining: 7.31s 131: learn: 0.7816201 test: 0.7295401 best: 0.7300036 (84) total: 14s remaining: 7.2s 132: learn: 0.7819011 test: 0.7292419 best: 0.7300036 (84) total: 14.1s remaining: 7.1s 133: learn: 0.7820418 test: 0.7287583 best: 0.7300036 (84) total: 14.2s remaining: 7s 134: learn: 0.7821512 test: 0.7290162 best: 0.7300036 (84) total: 14.3s remaining: 6.89s 135: learn: 0.7823206 test: 0.7287099 best: 0.7300036 (84) total: 14.4s remaining: 6.78s 136: learn: 0.7824001 test: 0.7284762 best: 0.7300036 (84) total: 14.5s remaining: 6.67s 137: learn: 0.7825068 test: 0.7287986 best: 0.7300036 (84) total: 14.6s remaining: 6.56s 138: learn: 0.7825784 test: 0.7288470 best: 0.7300036 (84) total: 14.7s remaining: 6.45s 139: learn: 0.7826093 test: 0.7288550 best: 0.7300036 (84) total: 14.8s remaining: 6.34s 140: learn: 0.7828848 test: 0.7288550 best: 0.7300036 (84) total: 14.9s remaining: 6.23s 141: learn: 0.7830259 test: 0.7289437 best: 0.7300036 (84) total: 15s remaining: 6.12s 142: learn: 0.7832220 test: 0.7291452 best: 0.7300036 (84) total: 15.1s remaining: 6.01s 143: learn: 0.7832958 test: 0.7290807 best: 0.7300036 (84) total: 15.2s remaining: 5.91s 144: learn: 0.7834429 test: 0.7292902 best: 0.7300036 (84) total: 15.3s remaining: 5.8s 145: learn: 0.7838671 test: 0.7293467 best: 0.7300036 (84) total: 15.4s remaining: 5.69s 146: learn: 0.7841213 test: 0.7292338 best: 0.7300036 (84) total: 15.5s remaining: 5.58s 147: learn: 0.7843456 test: 0.7291895 best: 0.7300036 (84) total: 15.6s remaining: 5.47s 148: learn: 0.7845023 test: 0.7295038 best: 0.7300036 (84) total: 15.7s remaining: 5.36s 149: learn: 0.7846147 test: 0.7293829 best: 0.7300036 (84) total: 15.8s remaining: 5.26s 150: learn: 0.7847867 test: 0.7295280 best: 0.7300036 (84) total: 15.9s remaining: 5.15s 151: learn: 0.7851065 test: 0.7291653 best: 0.7300036 (84) total: 16s remaining: 5.04s 152: learn: 0.7852908 test: 0.7293829 best: 0.7300036 (84) total: 16.1s remaining: 4.94s 153: learn: 0.7853364 test: 0.7294958 best: 0.7300036 (84) total: 16.2s remaining: 4.83s 154: learn: 0.7855362 test: 0.7294635 best: 0.7300036 (84) total: 16.3s remaining: 4.72s 155: learn: 0.7856045 test: 0.7296731 best: 0.7300036 (84) total: 16.4s remaining: 4.62s 156: learn: 0.7856633 test: 0.7298423 best: 0.7300036 (84) total: 16.5s remaining: 4.51s 157: learn: 0.7857083 test: 0.7298182 best: 0.7300036 (84) total: 16.6s remaining: 4.4s 158: learn: 0.7857470 test: 0.7297134 best: 0.7300036 (84) total: 16.7s remaining: 4.29s 159: learn: 0.7858367 test: 0.7295603 best: 0.7300036 (84) total: 16.8s remaining: 4.19s 160: learn: 0.7858722 test: 0.7299552 best: 0.7300036 (84) total: 16.9s remaining: 4.08s 161: learn: 0.7861054 test: 0.7304307 best: 0.7304307 (161) total: 17s remaining: 3.98s 162: learn: 0.7865086 test: 0.7305073 best: 0.7305073 (162) total: 17.1s remaining: 3.87s 163: learn: 0.7866752 test: 0.7306524 best: 0.7306524 (163) total: 17.2s remaining: 3.77s 164: learn: 0.7867518 test: 0.7304428 best: 0.7306524 (163) total: 17.3s remaining: 3.66s 165: learn: 0.7867863 test: 0.7305959 best: 0.7306524 (163) total: 17.3s remaining: 3.55s 166: learn: 0.7869098 test: 0.7304025 best: 0.7306524 (163) total: 17.4s remaining: 3.45s 167: learn: 0.7869745 test: 0.7302333 best: 0.7306524 (163) total: 17.5s remaining: 3.34s 168: learn: 0.7870931 test: 0.7303944 best: 0.7306524 (163) total: 17.6s remaining: 3.24s 169: learn: 0.7880330 test: 0.7298222 best: 0.7306524 (163) total: 17.7s remaining: 3.13s 170: learn: 0.7882452 test: 0.7297658 best: 0.7306524 (163) total: 17.8s remaining: 3.02s 171: learn: 0.7885092 test: 0.7283392 best: 0.7306524 (163) total: 17.9s remaining: 2.92s 172: learn: 0.7885371 test: 0.7282989 best: 0.7306524 (163) total: 18s remaining: 2.81s 173: learn: 0.7886262 test: 0.7282747 best: 0.7306524 (163) total: 18.1s remaining: 2.71s 174: learn: 0.7887395 test: 0.7285085 best: 0.7306524 (163) total: 18.2s remaining: 2.6s 175: learn: 0.7888634 test: 0.7286454 best: 0.7306524 (163) total: 18.3s remaining: 2.5s 176: learn: 0.7890053 test: 0.7290404 best: 0.7306524 (163) total: 18.4s remaining: 2.39s 177: learn: 0.7892140 test: 0.7286938 best: 0.7306524 (163) total: 18.5s remaining: 2.29s 178: learn: 0.7895246 test: 0.7287583 best: 0.7306524 (163) total: 18.6s remaining: 2.19s 179: learn: 0.7895972 test: 0.7290404 best: 0.7306524 (163) total: 18.7s remaining: 2.08s 180: learn: 0.7896173 test: 0.7289759 best: 0.7306524 (163) total: 18.8s remaining: 1.98s 181: learn: 0.7898228 test: 0.7288873 best: 0.7306524 (163) total: 18.9s remaining: 1.87s 182: learn: 0.7899619 test: 0.7290243 best: 0.7306524 (163) total: 19s remaining: 1.77s 183: learn: 0.7902713 test: 0.7288228 best: 0.7306524 (163) total: 19.1s remaining: 1.66s 184: learn: 0.7903462 test: 0.7290726 best: 0.7306524 (163) total: 19.2s remaining: 1.56s 185: learn: 0.7903650 test: 0.7290726 best: 0.7306524 (163) total: 19.3s remaining: 1.45s 186: learn: 0.7907125 test: 0.7290726 best: 0.7306524 (163) total: 19.4s remaining: 1.35s 187: learn: 0.7907419 test: 0.7291452 best: 0.7306524 (163) total: 19.5s remaining: 1.25s 188: learn: 0.7910036 test: 0.7289195 best: 0.7306524 (163) total: 19.6s remaining: 1.14s 189: learn: 0.7911637 test: 0.7285326 best: 0.7306524 (163) total: 19.7s remaining: 1.04s 190: learn: 0.7912037 test: 0.7284520 best: 0.7306524 (163) total: 19.8s remaining: 933ms 191: learn: 0.7912436 test: 0.7282344 best: 0.7306524 (163) total: 19.9s remaining: 829ms 192: learn: 0.7914913 test: 0.7280168 best: 0.7306524 (163) total: 20s remaining: 725ms 193: learn: 0.7915845 test: 0.7284601 best: 0.7306524 (163) total: 20.1s remaining: 621ms 194: learn: 0.7916299 test: 0.7285568 best: 0.7306524 (163) total: 20.2s remaining: 518ms 195: learn: 0.7916689 test: 0.7285729 best: 0.7306524 (163) total: 20.3s remaining: 414ms 196: learn: 0.7917764 test: 0.7286374 best: 0.7306524 (163) total: 20.4s remaining: 311ms 197: learn: 0.7919918 test: 0.7285165 best: 0.7306524 (163) total: 20.5s remaining: 207ms 198: learn: 0.7920384 test: 0.7282425 best: 0.7306524 (163) total: 20.6s remaining: 103ms 199: learn: 0.7920613 test: 0.7282344 best: 0.7306524 (163) total: 20.7s remaining: 0us bestTest = 0.7306523621 bestIteration = 163 Shrink model to first 164 iterations.
<catboost.core.CatBoostClassifier at 0x7f9db1735940>
#predicte values from the test dataset
strat_test_set["predicted_value"] = model.predict(strat_test_set.iloc[:,:-1])
strat_test_set["predicted_value"].value_counts()
0.0 1588 1.0 912 Name: predicted_value, dtype: int64
#calculating the evaluation metrics
print(classification_report(strat_test_set["clicked"], strat_test_set["predicted_value"]))
confusion_matrix(strat_test_set["clicked"], strat_test_set["predicted_value"])
mt.accuracy_score(strat_test_set["clicked"], strat_test_set["predicted_value"])
mt.precision_score(strat_test_set["clicked"], strat_test_set["predicted_value"])
mt.recall_score(strat_test_set["clicked"], strat_test_set["predicted_value"])
precision recall f1-score support
0.0 0.99 0.64 0.78 2447
1.0 0.05 0.79 0.09 53
accuracy 0.65 2500
macro avg 0.52 0.72 0.43 2500
weighted avg 0.97 0.65 0.77 2500
array([[1577, 870],
[ 11, 42]])
0.6476
0.046052631578947366
0.7924528301886793
Comment: From the outcome of the model, we could see that by sending all the email classified is "1" (means customer with click in the link) and stop sending all the email predicted as 0, we could increase the click rate to 4.6% (Recall rate)
feature_score = pd.DataFrame(list(zip(df.iloc[:,1:-2].dtypes.index, model.get_feature_importance(Pool(df.iloc[:,1:-2], label= df["clicked"], cat_features=categorical_features_indices)))),
columns=['Feature','Score'])
feature_score = feature_score.sort_values(by='Score', ascending=False, inplace=False, kind='quicksort', na_position='last')
#visualize feature score
plt.rcParams["figure.figsize"] = (12,7)
ax = feature_score.plot('Feature', 'Score', kind='bar', color='c')
ax.set_title("Catboost Feature Importance Ranking", fontsize = 14)
ax.set_xlabel('')
rects = ax.patches
# get feature score as labels round to 2 decimal
labels = feature_score['Score'].round(2)
for rect, label in zip(rects, labels):
height = rect.get_height()
ax.text(rect.get_x() + rect.get_width()/2, height + 0.35, label, ha='center', va='bottom')
plt.show()
Text(0.5, 1.0, 'Catboost Feature Importance Ranking')
Text(0.5, 0, '')
Text(0.0, 54.62495696395858, '54.27')
Text(1.0, 16.807285515417085, '16.46')
Text(2.0, 13.046982541118602, '12.7')
Text(3.0, 9.526929940809811, '9.18')
Text(4.0, 4.647151984780646, '4.3')
Text(5.0, 3.4466930539153546, '3.1')
Comment: I could be seen that among the respond variables, user_past_purchases and weekday has the most impact to the click ratio, therefore when doing A/B testing we should focus those variables in advance
Next step: Using the insights discovered above, we could strategically target to users by segment (user with past purchase less than 8 and more than 8). But on top of that, personalizing the email combining with tailoring the email content could also be considered when doing A/B testing.
We report here relevant references: